About the project

IODS stands for Introduction to Open Data Science - a MOOC platform course aiming to teach how to use open software tools and to analyse (openly available) data sets.

We will learn to use the following tools: R, R Studio, R Markdown and GitHub (including GitHub Desktop). Other resources inlude DataCamp and Slack.

The topics discussed during the course are: Regression and model validation, Logistic regression, Clustering and classification, Dimensionality reduction techniques, and Multivariate statistical modelling.

At the core of the course are the exercises: weekly DataCamp and RStudio exercises (including peer reviews) as well as a final assignment for which one of the topics of the course will be investigated in more detail.

Here is the link to my GitHub repository.


Regression analysis

Overview of the data

The dataset used in this analysis was prepared beforehand. Originally, it consisted of 60 different variables and a total of 183 observations. Several subgroups of variables were interrelated and they were combined into single variables by calculating the mean value of each row for each subgroup. Then, observations where exam points were zero were excluded.

## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.75 2.88 3.88 3.5 3.75 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The data deals with different aspects of learning statistics and their exam points. Variables ‘deep’, ‘stra’ and ‘surf’ are combined variables which were mentioned earlier, and they consist of questions related to the students’ learning obejctives: deep learning, strategic learning and surface learning. Variable ‘attitude’ represents the student’s global attitude towards statistics. Variable ‘gender’ includes two values: male (‘1’) and female (‘2’), and variable ‘age’ is recorded as the students’ numerical age in years.

## [1] 166   7

Thus, the final dataset consists of seven variables and 166 observations. The following picture provides a visual overview of the data. It shows the variables in more detail, as well as the relationships between different variables, including their correlation coefficients.

Let us begin with ‘gender’. As can be seen below, there are nearly twice as many women than there are men.

##   F   M 
## 110  56

Here are some summaries of the gender variable as a whole,

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    17.0    21.0    22.0    25.5    27.0    55.0

as well as for the subgroups separately (first male students, then female students).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    19.0    21.0    24.0    26.8    29.0    55.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    17.0    20.0    22.0    24.9    26.0    53.0

The overall ages range from late teens to mid-fifties. Both subgroups are rather similar. 50% of the students in both gender groups are in their twenties and female students are on average younger than male students (male: 21-29 / mean 26.8, female: 20-26 / mean 24.9). Within this 50%, the median values are located closer to the younger end (male: 24, female: 22). There are also many outliers in the older end of both subgroups.

It is perhaps not surprising, that male students’ attitude towards statistics is slightly more positive, but here too, the ranges are wide and there are exceptions.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    17.0    31.0    34.0    34.4    39.0    48.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    14.0    25.0    29.5    29.9    35.8    50.0

The central 50% of observations for women falls between 25.0 and 35.8, where as for men it is 31.0-39.0. Although the male group appears more uniform, there are outliers in this subgroup.

Both male and female students are in favour of deep learning. Male student’s exam points are somewhat higher, but female students tend to be a bit more consistent in their results as the box are is more narrow - and keeping in mind that there are twice as many students in this subgroup.

The strongest positive correlation (0.437) can be found between attitude and points, and it is stronger for men (0.451) than women (0.422), which is also easy to see in the kernel density graphs. Attitude also has a positive average correlation with age (0.0222) - for women (0.000756), attitudes towards statistics become very slightly more positive as they grow older, but the opposite holds for men (-0.0414).

Both deep learning and strategic learning have a positive correlation with age and surface learning have a negative one. As one gets older, one takes studying more seriously while being more selective in the process. Unfortunaley, and somewhat surprisingly, age has a slightly negative correlation with points.

Regression model

The next step is to try and fit a regression model, i.e. to try to explain the behaviour of the dependent variable (y) - in this case exam points. Explanatory variables were selected based on the findings in the previous section and the regression model can be stated as points ~ attitude + stra + age. Let us look at a summary of this model:

## 
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18.11  -3.20   0.33   3.41  10.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.8954     2.6483    4.11  6.2e-05 ***
## attitude      0.3481     0.0562    6.19  4.7e-09 ***
## stra          1.0037     0.5343    1.88    0.062 .  
## age          -0.0882     0.0530   -1.66    0.098 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.218,  Adjusted R-squared:  0.204 
## F-statistic: 15.1 on 3 and 162 DF,  p-value: 1.07e-08

The summary reveals that attitude is in fact a valid predictor for students’ success in their exams. The hypothesis behind testing of the model is that beta equals zero. Here, the estimated value of beta for attitude is 0.3481 with a standard error of 0.0562. This means that for a one unit increase in attitude, the exam points are expected to increase by 0.3481. The p-value of t-test is very low (4.7e-09) which means that the test result (the rejection of null hypothesis: beta=0) is highly significant.

As for strategic learning and age, their p values are furher away from zero and thus significant only at a signifance level of 0.1. The beta value of stra is higher than that of age. As age has the highest p value, I will remove it from the model in order to see how it affects the model. Here is a summary of the resulting model lm(y ~ x1 + x2) where attitude and strategic learning are used as explanatory variables for exam points.

## 
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.644  -3.311   0.558   3.793  10.930 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959    3.75  0.00025 ***
## attitude      0.3466     0.0565    6.13  6.3e-09 ***
## stra          0.9137     0.5345    1.71  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.29 on 163 degrees of freedom
## Multiple R-squared:  0.205,  Adjusted R-squared:  0.195 
## F-statistic:   21 on 2 and 163 DF,  p-value: 7.73e-09

Now that age has been removed, the beta values of both attitude and strategic learning have somewhat decreased (attitude: 0.3481 > 0.3466, stra 1.0037 > 0.9137) while the standard errors remain next to unaltered (attitude: 0.0562 > 0.0565, stra 0.5343 > 0.5345). The p values have also changed a little but their interpretations are the same as in the first model.

Based on these test results for the individual variables, it looks as if the model now fits better, but there is still the model as a whole to consider. The multiple R-squared is a measure for the goodness of the fit of the model. If the value is 1 (maximum), the regression model fits the observations exactly and there are no residuals. Here, the multiple R-squared value is 0.205 which means that the model is able to predict exam points with an accuracy of 20.5 %. In the first model (with three explanatory variables) the multiple R-squared value was higher (0.218) which shows that the more explanatory variables the model has, the higher the R-squared value will be. The same applies for the adjusted R-squared which accounts for the loss of explanatory power in the model due to less significant variables, and therefore will be lower than the multiple R-squared value. Its value in the first model was 0.204 and, having removed one variable, its value now is 0.195.

Although the R-squared values for a model with two explanatory variables will be lower than for a model with three variables, they are very similar and in both cases the explanatory power of the models is roughly 20% which indicates that the variables in this dataset alone do not explain how well a student does or does not do in their exam. Should I venture a guess, I would say that the hours spent studying have a more decisive impact on the results.

Also, the p values of the F test (focuses on the entire model instead of individual variables) - in both cases the p value is extremely small. The critical levels of F with (3, 162) / (2, 163) degrees of freedom 3.78 / 4.61 at the 1 percent significance level, so F statistics of 15.1 / 21 indicate significant degrees of explanation in both models.

Diagnostics

In the previous section, a regression model was fitted and investigated as to how well it fitted the observations. The purpose of this section is to look at the model from the point-of-view of residuals which reveal the downfalls of the model. They turn our attention to what was left over after the model was fitted and could reveal patterns which affect the fitted model. Here, I have chosen to investigate the latter model: points ~ attitude + stra.

Residuals vs Fitted values

Let us begin by looking at residuals versus fitted values which will show if residuals have non-linear patterns, i.e. if the variance of residuals is not constant.

The residuals are rather nicely spread around the horizontal line and there are no distinct patterns. There are some observations (35, 56, 145) which lie longer away from the line and would therefore be worth investigating closer, but other than those, there do not appear to be any major issues with the model.

Normal QQ-plot

Errors (residuals) in the regression model are assumed to be normally distributed, and the QQ-plot shows if this is the case for the fitted model.

Again, the residuals do not fall exactly on the straight line, especially in the beginning and the end, In general the plot does not reveal any major concerns, although observations 35, 56 and 145 do stand out again.

Residuals vs Leverage

Finally, let us take a look at the residuals versus leverage. The following plot shows if there are influential observations, i.e. how much impact single observations have on the model.

In this plot, we are not looking for patterns, but rather for outlying values at the upper or lower right corners as these would the ones that have influence on the regression model. As we can see, there are not any in this plot. Observations 35, 71 and 145 are marked but as they lie within Cook’s distance lines (which actually lie outside the plot area), they are not influential and the model appears to have a reasonably good fit.

However, the same numbered outliers appeared in all of these plots. The next step would be to take a closer look at them. They might have significance, or be just errors in the data.

Go to top of diary


Logistic regression

Overview of the data

The dataset used in this chapter contains combined information about the alcohol consumption of students of both maths and Portugese. The original datasets are available at the UCI Machine Learning Repository.

The combined dataset consists of 35 variables and 382 observations.

## [1] 382  35

In addition to alcohol consumption, the data provides a variety of background information for each student.

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The information provided by the variables can be summarised as follows:

  • personal
  • family, finance
  • allocation of time
  • social connections
  • attendance and academic success, and
  • alcohol consumption.

The original variables were complemented with two new variables:

  • “alc_use” is the average of alcohol consumption during workdays (“Dalc”) and weekends (“Walc”), and
  • “high_use” divides students into two groups based on whether or not their average alcohol consumption (“alc_use”) is high or low.

Selecting variables

In order to explain what causes for the level of alcohol consumption (high/low), we need to consider the other variables and how they might be related to this issue.

I expect that what we are dealing with is a group of young adults beginning to experiment with alcohol. I believe that in this age, especially in the younger end, the key factors behind alcohol use are peer pressure and amount of parental supervision. But it must also be kept in mind that the situation of a 15-year old student can be very different from that of a 22-year-old student. For this reason, we must factor in other related variables which may perhaps be the causes behind the level of alcohol consumption but correlate with it, such as academic behaviour. Time spent studying and grades do not appear as worthy factors, since they depend heavily on the student’s intelligence and to a lesser extent on their alcohol consumption. On the other hand, time spent partying and drinking does result in fatigue (to say the least) and could therefore be reflected in attendance.

Based this reasoning, I have selected the following four variables:

Factor Variable Description of variable
age age student’s age
peer pressure goout going out with friends
parental supervision famrel quality of family relations
academic success absences number of school absences

Let us next explore the selected variables beginning with age.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    15.0    16.0    17.0    16.6    17.0    22.0
## [1] 0.113

The age range is from 15 to 22 and the Pearson correlation value between age and high_use is 0.113. Older students tend to use more alcohol (high_use = TRUE) than younger ones. It is interesting to note that for high consumption the median age is lower for girls than for boys. In fact, 50% girls consuming a lot of alcohol are 16-17 years old, whereas the boys’ range is slightly wider. Both sexes start experimenting around the same age, but girls “peak” earlier, so to speak, and contain that level for a longer time.

Next we have time spent with friends, “goout”, where value 1 equals very low and 5 equals very high.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.00    3.00    3.11    4.00    5.00
## [1] 0.354

The figure above shows a clear difference between the consumption levels. Those in the high consumption group go out more often than those in the low consumption group, suggesting that peer pressure does indeed have an effect on alcohol consumption. Comparison between the two consumption levels reveals an interesting gender difference as well. Girls in the low consumption group go out more than boys, and even as much as in the high consumption group. For boys it either or: those boys who do not go out a lot do not drink much, and vice versa. Thus, girls spend a lot of time with their friends, but alcohol consumption is not as important a factor in building their social relations as it is for adolescent boys. Also, the correlation is now 0.354 which is higher than for age.

Next, let us zoom out a bit and look at the quality of family relationships where the values of the variable “famrel” are like those of our previous variable (1 = very bad, …, 5 = excellent).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    4.00    3.94    5.00    5.00
## [1] -0.113

With this variable, there is a negative correlation of -0.113 meaning that the better the relationships within the family are, the less likely it is that the offspring will consume high levels of alcohol. But the difference is not very sharp, as the figure reveals. The majority of students in both consumption levels appear to have relative good relationships with their families and although low consumption corresponds a bit better with very good family relationships, the medians are surprisingly the same (4) for both levels.

Absences, on the other hand, are not quite as clearly differentiated as I had anticipated.

## [1] 0.223

The correlation value is 0.223, and the figure shows that absences are a little more frequent in the high consumption group. But the medians are almost identical, and girls have more outliers in both consumption groups. Any meaningful differences can again be seen looking at the boys: low consumption corresponds to very few absences, less than for girls, whereas high consumption corresponds to some absences more than for girls.

Logistic regression model

Now I have fitted a logistic regression model between the variables discussed in the previous section (explanatory variables) and the level of alcohol consumption which is, as we have seen, is a binary variable (high / low). Below is a summary of the fitted model.

## 
## Call:
## glm(formula = high_use ~ age + goout + famrel + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.804  -0.756  -0.533   0.890   2.382  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.9196     1.8543   -2.11  0.03453 *  
## age           0.0966     0.1096    0.88  0.37770    
## goout         0.7582     0.1206    6.29  3.2e-10 ***
## famrel       -0.3564     0.1356   -2.63  0.00857 ** 
## absences      0.0721     0.0218    3.30  0.00095 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 395.01  on 377  degrees of freedom
## AIC: 405
## 
## Number of Fisher Scoring iterations: 4

Looking at the summary, we can see that apart from age, the selected explanatory variables are significant. Time spent with friends and absences from school are the most significant ones with test scores very close to zero. Although age as itself does not score very high in this model, the model as a whole is significant at a 0.01 significance level.

Odds ratio and confidence intervals

Let us next look at the odds ratio (OR) and confidence intervals (CI) of the fitted model.

##                 OR    2.5 % 97.5 %
## (Intercept) 0.0198 0.000496  0.726
## age         1.1015 0.888842  1.367
## goout       2.1345 1.695118  2.723
## famrel      0.7002 0.535323  0.913
## absences    1.0747 1.030941  1.124

Time spent with friends (“goout”) is twice as likely (OD = 2.1345) to result in a high level of alcohol consumption, as opposed to a low level of consumption. This complies well with the model’s test results above. The odds ratios for absences and family relationships are also similar to the test results. Interestingly, age was not as significant a variable as the other variables, but here the odds ratio for age (1.1015) is actually higher than for absences (1.0747). However, the 95% confidence interval reveals that age is not statistically significant at the 0.05 level (because the interval contains 1.0), and the other variables are significant.

Prediction and penalty

In order to explore the predictive power of my model to its full extent, I have modified it by including only those variables which were found to be statistically significant, i.e. I have dropped the age variable. The logistic regression equation is now:
\(y = -2.3882 + 0.7707(goout) - 0.3499(famrel) + 0.0745(absences)\)

Based on this modified model, you will find below a 2x2 cross tabulation of the predictions versus actual values of the level of alcohol consumption (TRUE being ‘high’ and FALSE being ‘low’). The prediction threshold has been set to 0.5.

##         prediction
## high_use  FALSE   TRUE    Sum
##    FALSE 0.6387 0.0628 0.7016
##    TRUE  0.1806 0.1178 0.2984
##    Sum   0.8194 0.1806 1.0000

The odds for a high level of alcohol consumption are 82%, but only about 64% of those predictions will be correct. Or, to put it another way, the odds for a low level of alcohol consumption are 18%, but only about 12% of those predictions will be correct.

We have now considered the model from the point of view of correctly predicted values, but we still need to find a measure for the accuracy of the model as a whole. This can be done by calculating the mean of incorrect predictions (penalty).

## [1] 0.243

So, the lower the penalty the better (no model will ever be completely accurate). The value here (0.243) seems reasonably low and in any effect it is lower than the value attained from the model used in the DataCamp exercises (0.256).

Cross-validation (*)

Continuing with the modified model, I will next perform a 10-fold cross-validation in order to test its actual predictive power.

## [1] 0.249

Compared to the model used in DataCamp, it appears that my model performs better (< 0.262). Both models have three explanatory variables, but only they have only one in common: ‘absences’. The difference in performance can only be explained by differences in the models. The DataCamp model considers academic underachievement (‘failures’) and gender (‘sex’) whereas I decided not to include academic performance due its dependence on other factors (e.g. intelligence) and instead investigated the influence of peer pressure (‘goout’) and parental involvement (‘famrel’). It would appear that my line of reasoning has directed me in the right direction.

Exploring performance (**)

The question still remains: is my model, although modified, the best possible one. Since only four variables were originally chosen, there are still over 30 variables which have not been put to the test. In this section I will approach modeling from another perspective: by favouring statistical measures and iterations over personal reasoning.
N.B. I will use K-fold cross-validation and display their results, but summaries will not be displayed for the sake of legibility.

Let us begin with a wide variety of possibly releant variables.

## [1] 0.272

The error is now higher than in any of the earlier models but this was expected - some variables are better predictors than others.

The summary of this model revealed that there are more not significant variables than there are significant ones. I will now drop all those variables which are not significant for any values at 0.05.

## [1] 0.241

As we can see, there is vast improvement and this model has given us the best result yet. There are still nine variables left so surely not all of them are valuable for the model.

There are actually several which can be dropped: ‘reason’ (although one value is somewhat significant, the others really are not), ‘paid’, ‘freetime’ and ‘health’. This leaves us with a total of five remaining variables.

## [1] 0.23

Again, the model improved, but why stop here?

I will remove yet another variable (‘address’) which, although it is significant at 0.05, is the least significant. Now we are left with four variables: ‘studytime’, ‘goout’, famrel’ and ‘absences’. Let us see what happens.

## [1] 0.238

As it turns out, the result is worse. Thus, the previous model was the best one. I am happy to note that it included all of the variables from my modified model (‘goout’, ‘famrel’, ‘absences’). On the other hand, my hypothesis about time spent studying not being a good measure was wrong, and I had not considered the effect of urban versus rural (‘address’) living at all which, of course, does determine one’s mobility and thus social environment.

Go to top of diary


Clustering and classification

Overview of the data

This week I am using a built-in dataset, namely the Boston dataset which contains housing information in the Boston Mass area. The data has been collected by the U.S Census Service and it is also available online.

The Boston dataset consists of just 506 observations and there are 14 variables. All of the variables are continuous. Let us take a closer look at what they are.

The variable names are not exactly self-explanatory. Because our analysis will again depend upon them, I will briefly list them here with a short description.

Variable Description
crim crime rate (per capita)
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres
chas Charles River dummy variable (1 if tract bounds river; 0 otherwise)
nox nitric oxides concentration
rm average number of rooms per dwelling
age proportion of owner-occupied units built prior to 1940
dis weighted distances to five Boston employment centres
rad index of accessibility to radial highways
tax full-value property-tax rate per $10,000
ptratio pupil-teacher ratio
black proportion of blacks
lstat % lower status of the population
medv median value of owner-occupied homes in $1000’s

Thus, the Boston dataset consists of a variety of information: demographic, economic, and environmental factors as well as safety. I will not discuss the variables in more detail just now, but save it for later when we explore the standardized dataset - the comparison between the original and scaled variables will be more meaningful if the summaries are presented together.

So, let us next take a graphical tour of the data. In the figure below, each variable is plotted against the other variables.

Here we see some interesting distribution patterns, such as
* accumulation close to the edges and corners of the box - e.g. age/lstat
* more curved shapes close to the lower left corner - e.g. nox/dis
* compact round shapes close to one of the corners - e.g. rad/tax
* binary positionas in all pairs for variables chas and rad.

Another way of approaching the relationships between variables is to look at their correlations. Below is a correlogram where, instead of numerical values, the correlations are presented as coloured spots. Red colour indicates a negative correlation and blue colour indicates a posititve correlation. The stronger the correlation, the bigger the spot.

Three pairs stand out as having a strong negative correlation: indus/dis, nox/dis and lstat/medv. It is interesting to note the connection between this and the previous graph. If we go back to the previous graph, we can see that these pairs are those which showed curved patterns.
As for positive correlations, there are many to choose from but one stands out from the rest: rad/tax. This was also identified above and it is the one with a compact round shape.

Standardization

The first step before attempting actual classification is to standardize the dataset. This is because we need the variables both to be comparable and to fit assumptions on which the classification method is based on (variables are normally distributed and their variance is the same).

Standardization is a simple operation in which the values of each variable are calculated using the following formula:
\(scaled(x) = [x-mean(x)]/sd(x)\)
i.e. the variable mean is substracted from the original value of the variable and the resulting difference is then divided by the standard deviation of the variable.

Now let us take a closer look at a selection of these scaled variables and compare them to the original, non-scaled, dataset (original values are shown first and then the scaled values). Here are summaries of age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.9    45.0    77.5    68.6    94.1   100.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -2.330  -0.837   0.317   0.000   0.906   1.120

tax (full value property tax rate)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     187     279     330     408     666     711
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -1.310  -0.767  -0.464   0.000   1.530   1.800

and black (proportion of blacks in the area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0     375     391     357     396     397
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -3.90    0.20    0.38    0.00    0.43    0.44

The standardization procedure has affected the variables considerably. To begin with, the range has been radically reduced: for age [2.9,100.0] -> [-2.333,1.116], for tax [187,711] -> [-1.313,1.796], and for black [0,397] -> [-3.90, 0.44]. What is most noteworthy, though, is the change in the mean values - in the scaled dataset, the mean value is 0.000 for all variables. The variables are not aligned and thus more easily comparable.

I will use this scaled dataset from this point onwards. There are still some minor adjustments to be made. In the analysis to come, I will use crime rate as the target variable, and in order to do so the continuous variable crim needs to be converted into a new, categorical variable crime (after which the original variable will be removed, because all the other variables will serve as the explanatory variables). The quantiles of the scaled variable crim

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -0.42   -0.41   -0.39    0.00    0.01    9.92

will serve as the breakpoints for the new categorical variables crime - the table below shows its distribution between the newly ceated categories.

## crime
##      low  med_low med_high     high 
##      127      126      126      127

Creating a model from your data is interesting by itself, but it falls short if there is nothing to test it on. Luckily, we can divide the dataset into training and test sets where the former is used for creating the model and the latter is used as new data for testing the model’s prediction power. I will use 80% of my data for training and the remaining 20% for testing.

Linear discriminant analysis

Now that the data has been prepared and divided, we can fit the linear discriminant analysis (LDA) on the previously created training set. LDA is a classification method where the classes are known beforehand. In our case they are the previously created four categories denoting different crime levels.

LDA finds a linear combination of the explanatory variables which best characterizes or separates the different classes of the target variable. In this current case, the categorical variable crime is the target variable and all other variables are the explanatory, or predictor, variables. Here is a scatterplot of the LDA where different colours represent the different classes. The length and direction of the arrows show the impact that each of the predictor variables has in the model.

It appears that accessibility to radial highways (rad) is associated with a high crime rate (in blue) and this relationship is more prominent than those between other predictors and lower crime rates. Although the medium high crime rate (in green) is almost completely separated from the high crime rate, it is somewhat a more uniform class than the lower crime rates (in black and red) and has an association with environmental factors (nox).

Prediction

Next, we can use the test set for predicting the classes in this new dataset and thus assess the model’s perfomance. Here is a cross tabulation of the actual vs predicted values.

##           predicted
## correct    low med_low med_high high Sum
##   low       13       8        2    0  23
##   med_low    4      19        4    0  27
##   med_high   0       5       20    0  25
##   high       0       0        0   27  27
##   Sum       17      32       26   27 102

Let us begin with the high crime rate: all predictions for this class are correct ones. Medium high crime rate is also most often predicted correctly. On the other hand, the predictions for low and medium low crime rates are not as accurate. These results confirm what we saw above in connection with the biplot.

K-means clustering

The aim is to create clusters (i.e. groups) of the observations. This can be done by adopting the k-means method which assigns the observations into k clusters based on the similarity of the objects. Similarity can in turn be measured in terms of distances. The following analysis is based on the Euclidean measure. Below is a summary of those distances.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.13    3.46    4.82    4.91    6.19   14.40

Now we can run the k-means algorithm. Because I do not yet know how many cluster I should use, I will begin by setting them at 10 which should be a relatively safe bet.

Fortunately, we do not need to rely on guessing as there are more sophisticated ways of finding the optimal number of clusters. Here, I will use total of within cluster sum of squares (WCSS) to do so. WCSS calculates the number of clusters for which the observations are closest to the center of the cluster. The optimal number is the one where to value drops significantly. In the line graph below we can see that the drop occurs where the number of clusters is about two.

I will now run the k-means algorithm again with the clusters set at two. The figure below is a similar to the representation of pairs we saw earlier, but here it isthe scaled variables that are plotted against each other. Also, instead of the original values, the patterns seen here represent the similarities (or differences), i.e. distances, between the variables.

Disregarding the colours for a brief moment, the overall distribution patters appear nearly identical to those we saw earlier. However, now that clusters are included and shown in different colours, we can see how the patterns relate to the clusters. The differences between the clusters is not always clear, but for very main pairs they reveal certain uniqua, or even binary patterns. The differences are most clear for all instances of the target variable crim (crime rate) and chas (dummy variable), but also for rad (accessibility to radial highways) and tax (full-value property-tax rate).

3D plot

We were also given the code for producing a 3D plot. I did not have time to study it in detail, but I absolutely had to see what what it looks like. So here it is!

Go to top of diary